General Bias/Variance Decomposition with Target Independent Variance of Error Functions Derived from the Exponential Family of Distributions

ثبت نشده
چکیده

An important theoretical tool in machine learning is the bias/variance decomposition of the generalization error. It was introduced for the mean square error in [3]. The bias/variance decomposition includes the concept of the average predictor. The bias is the error of the average predictor, and the systematic part of the generalization error, while the variability around the average predictor is the variance. We present a large group of error functions with the same desirable properties as the bias/variance decomposition in [3]. The error functions are derived from the exponential family of distributions via the statistical deviance measure. We prove that this family of error functions contains all error functions decomposable in that manner. We state the connection between the bias/variance decomposition and the ambiguity decomposition [7] and present a useful approximation of ambiguity that is quadratic in the ensemble coefficients. 1 Notation and problem domain The problem domain of this paper is finding the functional relationship between output and input based on an example set of target-input pairs S = f(t1; ~x1); : : : ; (tn; ~xn)g. To make this a relevant problem it is assumed that the set is generated with noise from a function r(~x). We wish to find a predictor f(~ w; ~x) that is as close as possible to r(~x). The vector ~ w refers to the parameters that describe the predictor, e.g. the weights in a neural network. Furthermore, we are interested in the situation where we have an ensemble of predictors characterized by a distribution F, which is independent of the noise distribution. The mean operator is denoted h iF. The set of predictors can be finite or infinite. We will generally look at only one input point, so for notational convenience we will omit dependency of functions on input; we also omit the parameters of the predictors. The inaccuracy or error of a predictor is measured with an error function E(t; f). 2 Bias/variance decomposition If S is noisy it is not guaranteed that r(~xi) = ti for all i. It is therefore not optimal to find a predictor with f(~xi) = ti for all i. If S is noise-free and we find a predictor with f(~xi) = ti for all i, then the predictor can be different from the function r(~x) on all other points. In both cases, by using the principle of Occam’s Razor, the class of possible predictors should be restricted, e.g. by limiting the number of weights in neural networks. This raises an important question: Just how large a class of predictors should be used? If the class is too small, the predictors are too simple and cannot predict the target functions. On the other hand, if the class is too large, the predictors can become too complex and overfit. The two cases correspond to two different kinds of errors: Bias and Variance. To fully understand the difference between the errors we look at an ensemble of predictors. The mean of the predictors is the average predictor. The error of the average predictor expresses the systematic error of the predictors, i.e. the bias, while the mean of the error between the predictors and the average predictor expresses the stochastic error i.e. the variance. Generally, both kinds of errors will be made by an ensemble of predictors. We would like to be able to split the mean of the generalization error into a bias and a variance term: Error = Bias+V arian e. This is the bias/variance decomposition. It was introduced in [3] for the mean square error (EMSE(t; f) = 1 2 (t f)2). The average predictor is f = hfiF. The mean generalization error is hEMSE(t; f)iF. We have hEMSE(t; f)iF = EMSE(t; f) + hEMSE( f; f)iF; (1) where EMSE(t; f) is the bias and hEMSE( f; f)iF is the variance. The mean square error and the corresponding bias/variance decomposition have a number of crucial properties: The error is zero and minimal when the predictor equals the target. The bias only depends on the predictors through the average predictor. Variance does not depend on the target, and is minimal for the average predictor. The last property is more general than f = hfiF. The mathematical definition is f = argminthE(t; f)iF. For the mean square error, argminthE(t; f)iF is equal to hfiF. The above mentioned requirements can be formulated mathematically: R1: argminfE(t; f) = t. R2: E(t; t) = 0. R3: The bias/variance decomposition: hE(t; f)iF = E(t; f) + hE( f; f)iF; (2) where f = argmin t hE(t; f)iF: (3) R1 and R2 can be derived from R3. 3 General bias/variance decomposition Not all error functions obey the requirements R1-R3. E.g. the 0-1 loss error function is impossible to decompose as in R3 (see e.g. [2, 6]). A natural question is: What error functions obey the requirements? We prove that only the error functions corresponding to the deviance of oneparameter members of the exponential family of distributions obey R3. The deviance error function for members of the exponential family is presented in section 4. A sketch of the proof that they are the only error functions obeying the requirements R1-R3 can be found in section 5. In section 6 some examples of deviance error functions are given. In section 7 the connection between the bias/variance decomposition and the ambiguity decomposition is made, and an approximation of ambiguity is presented, which is quadratic in the ensemble coefficient. 4 Deviance error function and the exponential family It is well-known that the mean square error can be interpreted as the negative loglikelihood under a Gaussian noise model. The density of the Gaussian (Normal) distribution, with standard deviation equal to one, is given by p(tjf) = 1 p2 exp[ 1 2(t f)2℄: The mean square error is connected to the density by E(t; f) = log p(tjf) + log p(tjt); where the last term is added to ensure requirement R2. This is also called the deviance [8]. The Normal distribution is a special case of the one-parameter exponential family. The general form is [1] p(tjf) = exp[ (f)T (t) + d(f) + S(t)℄; (4) where is the canonical link function, the function T is the sufficient statistic, and d is a normalization term. The function S plays no role in the following. The density p(tjf) yields the deviance error function E(t; f) = [ (t) (f)℄T (t) + d(t) d(f): (5) To ensure argminfE(t; f) = t we also need the constraint 0(y)T (y) + d0(y) = 0 (6) The Normal distribution has (f) = f; T (t) = t, and d(f) = 1 2f2 for standard deviation equal to one. Furthermore the constraint (6) is obeyed. The function d is given by the constraint (6), so the deviance error function is completely determined by the canonical link function and the sufficient statistic T . The error function in (5) upholds R1-R3. R1 follows from the constraint (6). Requirement R2 is obeyed because of the definition of the deviance error. For the error functions in (5) we have the corollary f = argmin t hE(t; f)iF = 1h (f)iF: With this definition of the average predictor the decomposition (2) is easily verified. The variance is by definition independent of the target, and is given by var(f) = hE( f; f)iF = hd( f) d(f)iF (7) 5 Error functions with “nice” bias/variance decompositions As mentioned, the error functions on the form in (5) are the only error functions obeying the requirements R1-R3. Here we sketch the proof. We will show that 2E(t;f) t f = A1(f)A2(t). This suffices since it shows that the error functions must be of the form canonical link sufficient statistic plus some terms depending on either t or f . The crux of the proof is that slightly changing the distribution over f , the change in the average predictor affects 2E(t;f) t f multiplicatively and independently of t. The derivative with regard to t of the bias/variance decomposition (2) is hE(t; f)iF t = E(t; f) t ; for all distributions F. Let denote the distributions with density (f) and average predictor f . Let Æ(f) be a function that integrates to zero over the domain of f . Then (f j ) = (f) + Æ(f) is a density with average predictor f( ). We have that hE(t; f)i ( ) t = hE(t; f)i t + R dfÆ(f)E(t; f) t : Differentiation with regard to yields 2E(t; f( )) t f f( ) = R dfÆ(f)E(t; f) t : In the limit ! 0, 2E(t; f( )) t f becomes 2E(t; f) t f . The right hand side does not depend on f , while f( ) does not depend on t. Then it must be that indeed 2E(t;f) t f = A1(f)A2(t). 6 Examples of error functions We consider two special cases, linear sufficient statistic and linear canonical link, and show how they can be regarded as being “transposed”. The common univariate distributions in the exponential family have linear sufficient statistic, but generally nonlinear canonical link. The constraint (6) becomes d0(y) = 0(y)y, which is equivalent with d(y) = (y)y + C(y), where C is the anti-derivative of . Table 1 gives an overview of some of the error functions with non-linear average predictors. Any constant factors are omitted. DIST. ERROR FUNCTION DOMAIN Normal 12 (f t)2 ℄ 1;1[ Poisson [f t℄ + t log t f [0;1[ Binomial t log t f + (1 t) log 1 t 1 f [0; 1℄ Gamma ( t f 1) + log ft ℄0;1[ Inv. Gauss. (f t)2=(f2t) ℄0;1[ Table 1. Err. func. with linear suf. statistics. As mentioned above the defining functions are the sufficient statistic and the canonical link. By interchanging these two we find the transposed family of error functions. To ensure the constraint (6) the functions d in the transposed error functions are set to dTRANSPOSE(f) = d(f) (f)T (f): The error functions in table 1 have linear sufficient statistics. The transposed error functions therefore have linear canonical links. The average predictor is given by f = 1h (f)iF, so the transposed error functions have linear average predictors. In table 2 is an overview of transposed error functions with linear average predictor. Constant factors are omitted. DIST. ERROR FUNCTION DOMAIN Normal 12 (f t)2 ℄ 1;1[ Poisson [t f ℄ + f log ft [0;1[ Binomial f log ft + (1 f) log 1 f 1 t [0; 1℄ Gamma ( ft 1) + log t f ℄0;1[ Inv. Gauss. (f t)2=(t2f) ℄0;1[ Table 2. Err. func. with linear can. links. Note that the transposed error functions have the predictor and the target interchanged compared to their counterparts. Also note that the Normal error function and transposed counterpart both are the mean square error. The transposed Binomial error function is not very useful, since it is undefined for target equal to one or zero.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

General Bias/Variance Decomposition with Target Independent Variance of Error Functions Derived from the Exponential Family of Distributions

An important theoretical tool in machine learning is the bias/variance decomposition of the generalization error. It was introduced for the mean square error in [3]. The bias/variance decomposition includes the concept of the ave-rage predictor. The bias is the error of the average predictor, and the systematic part of the generalization error, while the variability around the average predictor...

متن کامل

Empirical Bayes Estimators with Uncertainty Measures for NEF-QVF Populations

The paper proposes empirical Bayes (EB) estimators for simultaneous estimation of means in the natural exponential family (NEF) with quadratic variance functions (QVF) models. Morris (1982, 1983a) characterized the NEF-QVF distributions which include among others the binomial, Poisson and normal distributions. In addition to the EB estimators, we provide approximations to the MSE’s of t...

متن کامل

Minimax Estimation of the Scale Parameter in a Family of Transformed Chi-Square Distributions under Asymmetric Squared Log Error and MLINEX Loss Functions

This paper is concerned with the problem of finding the minimax estimators of the scale parameter ? in a family of transformed chi-square distributions, under asymmetric squared log error (SLE) and modified linear exponential (MLINEX) loss functions, using the Lehmann Theorem [2]. Also we show that the results of Podder et al. [4] for Pareto distribution are a special case of our results for th...

متن کامل

Estimating bias and variance from data

The bias-variance decomposition of error provides useful insights into the error performance of a classifier as it is applied to different types of learning task. Most notably, it has been used to explain the extraordinary effectiveness of ensemble learning techniques. It is important that the research community have effective tools for assessing such explanations. To this end, techniques have ...

متن کامل

Spatial prediction of soil electrical conductivity using soil axillary data, soft data derived from general linear model and error measurement

     Indirect measurement of soil electrical conductivity (EC) has become a major data source in spatial/temporal monitoring of soil salinity. However, in many cases, the weak correlation between direct and indirect measurement of EC has reduced the accuracy and performance of the predicted maps. The objective of this research was to estimate soil EC based on a general linear model via using se...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007